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Abstract: 

The 4,639,221'base pair sequence of Escherichia coli K-12 is presented in a study. Of 4,288 
protein-coding genes annotated, 38% have no attributed function. 
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The 4,639,221-base pair sequence of Escherichia coli K-12 is presented. Of 4288 protein-coding genes annotated, 38 percent 
have no attributed function. Comparison with five other sequenced microbes reveals ubiquitous as well as narrowly distributed 
gene families; many families of similar genes within E. co/il are also evident. The largest family of paralogous proteins contains 
80 ABC transporters. The genome as a whole is strikingly organized with respect to the local direction of replication; guanines, 
oNgonucleotides possibly related to replication and recombination, and most genes are so oriented. The genome also contains 
insertion sequence (IS) elements, phage remnants, and many other patches of unusual composition indicating genome plasticity 
through horizontal transfer. 

Because of its extraordinary position as a preferred model in biochemical genetics, molecular biology, and 
biotechnology, E. coli K-12 was the earliest organism to be suggested as a candidate for whole genome 
sequencing (1, 2). The availability of the complete sequence of E. coli should stimulate further research 
toward a more complete understanding of this important experimental, medical, and industrial organism. 
Since the inception of the E. coli project, six other coniplete genomes have become pubhcly available (3). 
Genome sequences, especially those of well-studied experimental organisms, help to integrate a vast 
resource of biological knowledge and serve as a guide for further experimentation. Availability of the 
complete set of genes also enables global approaches to biological function in living cells (4) and has led to 
new ways of looking at the evolutionary history of bacteria (5). 

Escherichia coli is an important component of the biosphere. It colonizes the lower gut of animals, and, as 
a facultative anaerobe, survives when released to the natural environment, allowing widespread 
dissemination to new hosts (6). Pathogenic E. coli strains are responsible for infections of the enteric, 
urinary, puhnonary, and nervoiis systems. We chose strain MG1655 as the representative to sequence 
because it has been maintained as a laboratory strain with minimal genetic manipulation, having only been 
cured of the temperate bacteriophage lambda and F plasmid by ultraviolet light and acridine orange, 
respectively (7). We now know that these treatments resulted in a frameshift mutation at the end of rph, 
causing low expression of the dovrastream gene pyrE and, in turn, a pyrimidine starvation phenotype (8). In 
addition, a mutation in ilvG disrupts one of the isoleucinevaline biosynthesis pathways in all K-12 isolates 
(9), Finally, almost all K-12 derivatives, including MG1655, carry the rfb-50 mutation, where an IS 5 
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insertion results in the absence of 0-antigen synthesis in the hpopolysaccharide (10). It will be interesting 
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experimental treatments and is being sequenced in Japan (II). 
Sequencing Strategy 

Sequencing was carried out in sections, with steadily improving technical approaches. The Ml 3 Janus 
shotgun strategy proved to be the most efficient strategy for data collection and closure. It involved initial 
random sequencing at a four- to fivefold redundancy in the Janus vector (12), followed by computerized 
selection of templates to be resequenced fi*om the opposite end, followed by limited primer walking. 

The first 1.92 Mb (13, 14), positions 2,686,777 to 4,639,221 [in base pairs (bp)], was sequenced fi-om our 
overlapping set of 15- to 20-kb MG1655 lambda clones (15) by means of radioactive chemistry and was 
deposited in GenBank between 1992 and 1995. Subsequently, we switched to dyeterminator fluorescence 
sequencing (Applied Biosystems). In addition to greater speed and lower cost, this new technology avoided 
electrophoretic compression artifacts, which, owing to its 50.8% G+C content, occur in practically every 
gene of E. coli. For the next segment (positions 2,475,719 to 2,690,160), we obtained DNA for sequencing 
by the popout plasmid approach (16), in which nonoverlapping segments were excised directly fi-om the 
chromosome in circular form, gel-purified, and shotguimed for sequencing. The largest portion of the 
genome (positions 22,551 to 2,497,976) was sequenced fi-om M13 Janus shotguns prepared fi-om 11 1-Sce I 
fi*agments of -250 kb (17). Among the many advantages of the I-Sce I method are the ability to select the 
size of fragment to be shotgiumed, elimination of redundant sequencing at the borders between segments, 
and the reliability inherent in sequencing DNA without intermediate cloning steps. Because the DNA is 
never amplified, genes that might be deleterious when present in multicopy form are not subject to 
rearrangements or deletions. Each I-Sce I fragment shotgun contained 15 to 30% random clones from 
elsewhere in the genome, which apparently arose from randomly sheared genomic fragments comigrating 
in the pulsed-field gel. 

The final stages entailed special attention to problem areas. The region between positions 0 and 22,551 did 
not yield a suitable I-Sce I fragment, so three lambda clones were selected to finally complete the genome. 
One of them was found to contain a deletion and had to be finished by shotgun sequencing of a long-range 
polymerase chain reaction (PGR) fragment (IS). Other areas of the genome were also resequenced in this 
way. In total, long-range PGR (18) was used to close 36.9 kb of gaps, with amplimers used directly as 
sequencing templates or as source material for shotguns. 

The completed sequence was deposited in GenBank on 16 January 1997; in that sequence 168 ambiguity 
codes reflected uncertainties in the original determination. While this manuscript was in review, additional 
PGR sequencing was undertaken to resolve all of these ambiguous residues, and the affected annotations 
were updated accordingly. 

Annotation 

Annotation is an ongoing task whose goal is to make the genome sequence more usefiil by correlating it 
with other knowledge. Specifically, we attempted to (i) identify genes, operons, regulatory sites, mobile 
genetic elements, and repetitive sequences in the genome; (ii) assign or suggest fimctions where possible; 
and (iii) relate the E. coli sequence to other organisms, especially those for which complete genome 
sequences are available. Gurrently, the annotation includes 4288 actual and proposed proteincoding genes, 
and one-third of these genes are well characterized. Postulation of genes in uncharacterized base sequences 
was surprisingly difficult. They were selected from among the numerous available open reading frames 
(ORFs) on the basis of codon usage statistics, sequence searches versus SWISS-PROT release 34, Link's 
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database of NH^sub 2^-terminal peptide sequences from E. coli, computer prediction of signal peptides, 
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communications from colleagues (19). Assignment of NH^sub 2^termini posed special problems because 
most ORFs contain multiple in- frame start codons. In the absence of other information, we generally 
selected the ORF with the longest possible NH2-terminus. This method preserves the most coding 
information for analysis, but it may not reflect the situation in vivo. 

Functions of previously known E. coli proteins were collected from the GenProtEC (20) and EcoCyc (21) 
databases. The fimction of new translated sequences was imputed from sequence similarity (22). Each gene 
(including stable RNA genes) in the sequence was assigned a unique numeric identifier beginning with a 
lowercase "b"; when no name has been assigned to a given gene, it is referred to by this number. A specific 
physiological role was assigned if most of the hits were for a specific fiinction such as alcohol 
dehydrogenase, but if the substrates varied among the hits, the common denominator (for example, 
permease or kinase) was assigned to the ORF, substrate specificity unknown. If less specificity was found 
among the hits, a general fiinction was assigned to an ORF when a majority of the hits were for one type of 
fimction, such as a permease or a class of enzymes. When the fimctions of the hit sequences were varied 
and there was no solid agreement even for type of fimction, or when only one sequence was hit, no fiinction 
was assigned to the query ORF and it was counted among the unknowns. 

The average distance between E. coli genes is 1 18 bp. The 70 intergenic regions larger than 600 bp were 
reevaluated for the presence of ORFs (Geneplot, DNASTAR Inc.) and searched against the entire GenBank 
database for DNA sequence (BLASTN) and protein coding (BLASTX) features (23). Closer inspection 
revealed that 15 of these regions contain previously unannotated ORFs, which in most cases were 
overlooked because of their small size. An additional 1 1 intergenic regions contain sequence features such 
as long untranslated leader sequences [for example, oppA messenger RNA (mRNA) extends -500 bp 
upstream of the start codon (24)] or well-characterized control regions [for example, the araFGH operon 
control region (25)]. The remaining 44 large intergenic regions fall into three general classes: putative gene 
regulatory regions, large repetitive sequences, and unknowns. 

Genes separated by more than 600 bp are likely to contain independent regulatory sequences. Twenty-nine 
large intergenic regions contain sequences suggestive of regulatory fianctions, including 21 with predicted 
regulatory protein binding sites. There are 13 regions between divergently transcribed ORFs, and 1 1 of 
these have at least one predicted promoter for each ORF (2 have only one predicted promoter). The 16 
regions between ORFs transcribed in the same direction contain at least one predicted promoter for the 
downstream ORF, and several contain a terminator for the upstream ORF. Seven of the large intergenic 
regions, including the largest region overall (1730 bp), consist of repeated sequences such as REP or LDR, 
as described below. Seven intergenic regions larger than 600 bp have no predicted regulatory or coding 
fimctions. Five of these regions contain sequences that could encode proteins of at least 50 amino acids, 
although codon usage patterns for these ORFs suggest that they are not expressed. It is likely that these 
regions contain additional, as yet undiscovered, fimctions such as binding sites for additional regulatory 
proteins. 

We searched for promoter and protein binding site sequences upstream of 2436 genes. This includes all 
genes except those that are less than 70 bp from the 3' end of an adjacent gene transcribed in the same 
direction. We limited our search to the 250 bp upstream of the predicted translational start sites because E. 
coli promoters are typically found within this region. We also searched for other potential regulatory sites 
within the 400-bp segments upstream of genes. This search was based on an exhaustive collection of 
known fimctional sites for 56 transcriptional regulatory proteins. More detail about these methods is 
available elsewhere (26). 
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The codon adaptation index (CAI) was calculated for each ORF according to the method of Sharp and Li 
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expressed genes. CAI is a predictor of the extent of gene expression. This is attributed to correspondence 
with iso-accepting tRNA abundance of E. coli and optimal (intermediate) codon-anticodon interaction 
energy (28). Genes with exceptionally low CAI values may be recent horizontal transfers that still reflect 
the optimal codon usage or mutational spectrum of their previous host (29). We identified clusters of four 
or more adjacent genes with low CAI values (<0.25) and also identified all genes in the lower 10th 
percentile of CAI observed in this genome. 

The annotated sequence (accession number U00096) is available at the National Center for Biotechnology 
Information (NCBI) through the Entrez Genomes division, GenBank, and the BLAST databases. Our FTP 
site (ftp.genetics.wisc.edu) will maintain an updated version of the sequence as additional annotations or 
corrections are made; the version discussed here is M49, 



Overview of the Sequence 



The genome of E. coli, diagrammed in Fig. 1, consists of 4,639,221 bp of circular duplex DNA (30). Both 
base pair and minute scales are shown; base pair 1 was assigned in an apparently featureless region 
between genes lasT and thrL. Protein-coding genes account for 87.8% of the genome, 0.8% encodes stable 
RNAs, and 0.7% consists of noncoding repeats, leaving -11% for regulatory and other functions. A radial 
plot shows E. coli's local similarity to sequenced bacteriophage genes. The polar coordinate plot of CAI is 
designed to highlight regions of the genome with unusual codon usage, which may signify recent 
immigration by horizontal transfer. Some gene clusters with low CAI values correspond to known cryptic 
prophages, and others point to possible locations of additional horizontally acquired elements. 

The origin and terminus of replication divide the genome into oppositely replicated halves, which we term 
replichores. Replichore 1, which is replicated clockwise, has the presented strand of E. coli as its leading 
strand; in replichore 2 the complementary strand is the leading one. Many features of E. coli are oriented 
with respect to replication. All seven ribosomal RNA (rRNA) operons, and 53 of 86 tRNA genes, are 
expressed in the direction of replication (Fig. 1). Approximately 55Vo of proteincoding genes are also 
aligned with the direction of replication, confirming an early observation of Brewer (3 1 ). 



Compositional organization of the genome. Several authors (32, 33), in analyzing a variety of systems, 
have commented on base compositional asymmetries correlated with the direction of repUcation. For E. 
coli, the leading strands of both replichores have significantly (P < 0.001 ) greater abundance of G 
(26.22%) than its complementary partner C (24.58%) or the altemative pair A (24.52%)) or T (24.69%*). 
Lobry (33) plotted G-C skew for a 1.6-Mb section of E. coli surroimding the origin and summarized the 
data by codon position and gene direction. We extended this G-C skew analysis to the entire E. coli 
genome (Fig. 2), observing the same sharp transition at the terminus that he reported at the origin. These 
clear trends in base compositional skew apply to genes in both orientations, to intergenic regions, and to all 
codon positions (Table 1), supporting the idea advanced by Lobry, Pema, and Wu (32, 33) that leading and 
lagging strands are subject to differential mutation as the result of asymmetry inherent to the DNA 
replication mechanism. This, combined with natural selection, leads to an observed base distribution that 
depends in part on the mutational pattem and in part on selection. Hence, intergenic regions and third 
positions in E. coli are more skewed than first and second positions, and the net G-rich tendency of the 
leading strand relative to the lagging one is seen in both first and second codon positions, despite strong 
and sometimes opposite codon usage preferences at those positions. 
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Fig. i; 



Replication, recombination, and skew. We carried out further analyses of E. coli by constructing a 
reference sequence composed of the leading strands of each replichore concatenated at a novel joint, and 
we examined this sequence for oligonucleotide distribution. The most frequent oligomers in this leading 
strand (for example, octamers; Table 2) form a family containing the trimer CTG, often within the 
pentamer GCTGG, as also noticed by Karlin and co-workers (34). We note that the DnaG primase-binding 
site includes (or is) the sequence CTG, with T being the template for the first base of the RNA primer of 
Okazaki fragments (35). Although there is no direct proof implicating these sequences in discontinuous 
replication, their spacing is consistent with Okazaki fragment sizes and their distribution is skewed toward 
the leading strand, as expected. Although the skews are significant, the most frequent octamers on the 
leading strand are overrepresented on the lagging strand as well. Although leading strand replication is 
highly asymmetric in vitro, both leading and lagging strands are reported to replicate discontinuously in 
vivo (36). The high abundance of these proposed DnaG primasebinding sites on both strands supports a 
model in which both strands are replicated discontinuously. The associated skews imply that the leading 
strand has fewer sites for Okazaki initiation. 
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The recombinational hotspot Chi (37, 38), the third most abundant octamer of the leading strand, also 
contains the proposed DnaG primase-binding site. Li fact, none of the frequent octamers differs from Chi 
by changes known to inactivate the recombinational activity of Chi (37, 38). Hence, it is possible that other 
members of the family may display Chi activity. As noted earlier (14), the Chi site is markedly skewed 
toward the leading strand. One must skip to the 251st most frequent octamer, GCAGGGCG, to locate a 
higher skew (57%) than that of the Chi site (50%) (Fig. 2, lines 8 and 9). Chi sites are implicated in 
RecBCD-mediated recombination (37), and as part of this process it is supposed that single-stranded DNA 
intermediates having Chi at the 3' end are formed, which then invade the recipient chromosome to form a 
"D loop." This implies the existence of a Chi site on the displaced strand. If the CTG of Chi is also a 
primase binding site, Okazaki initiation at Chi could facilitate strand assimilation by branch migration. 
Kuzminov has recently proposed a role for Chi sites in the recombinational repair of collapsed replication 
forks, which may explain their extreme skew (38), but a secondary role as a primase-binding site may be 
sufficient to explain this bias. 

Rare tetramer CTAG. It is well known that the palindromic tetramer CTAG is extremely rare in E. coli, 
with an abundance 5% of that predicted from the base composition. Various explanations have been 
offered (39, 40). In Table 3 we have analyzed its distribution in various subsets of the genome. Clearly, the 
rarity of CTAG is most pronounced in protein-coding regions. Its occurrence is considerably higher in 
intergenic DNA, but it is surprisingly abundant in genes coding for structural RNAs, especially in that 
minuscule portion of the genome that codes for tRNAs. Danchin and co-workers (40) have hypothesized 
that CTAG may "kink" DNA and thereby interfere with fimction. It is also possible that some peculiar 
folding behavior of CUAG in RNA might interfere with mRNA fimction while having no negative effect 
on stable RNA species. 

Newly Proposed Genes and Previously Mapped Genes 

Six new tRNA genes. In this study we discovered six new tRNA genes. Four of the genes- valZ, lysY, lysZ, 
and lysQ (positions 780,291 to 780,875)-are part of the lysT operon and consist of a duplicate of valT and 
three duplicates of lysW. The other two genes form single-gene transcriptional units: asnW (positions 
2,056,049 to 2,056,124) is a duplicate copy of asnT, and ileY (positions 2,783,782 to 2,783,857) is a near 
copy of ileX, differing in a single compensating base pair change in the aminoacyl stem of the tRNA 
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(C6G67 in ileX, A6T67 in ileY). 

An operon for degradation of aromatic compounds. Six E. coli enzymes are known to constitute a pathway 
for the degradation of aromatic compounds such as phenylpropionate, but only two of the genes have been 
previously identified, mhpB and mhpE (41). On the basis of similarity searches, we are confident that there 
is an operon that starts with the monooxygenase gene mhpA, followed by the known dioxygenase gene 
mhpB, the hydrolase gene mhpC, the hydratase gene mhpD, the dehydrogenase gene mhpF, and the known 
gene mhpE coding for 4-hydroxy2-oxovalerate aldolase. All the genes (positions 367,835 to 373,095; 
b0347 to b0352) are in the same order as the enzymes of the pathway. We propose that the next gene 
upstream (positions 366,81 1 to 367,758; b0346) may be the regulator for this pathway, because this 
sequence is similar to a number of transcriptional regulators. 

A second operon for degradation of aromatic compounds. We have foimd a previously unrecognized set of 
E. coH genes (positions 2,667,052 to 2,671,269) that resemble Pseudomonas genes for the degradation of 
the aromatic compounds toluene, benzene, and biphenyl (42). The first three genes (b2538 to b2540) 
encode the oL and [ subunits and the ferredoxin component of the 1,2-dioxygenase that opens the rings and 
oxidizes carbons 1 and 2. The gene encoding the last component of the dioxygenase, the ferredoxin 
reductase (b2542), is separated from the first three genes by another ORE (b2541). The product of this ORE 
resembles the enzyme dihydro-l,2-diol dehydrogenase, which acts on the product of the dioxygenase to 
generate catechol. This proposed operon is preceded by a divergently transcribed ORE (b2537) resembling 
a number of transcriptional regulators, which may be involved in the regulation of the genes. We do not 
know the substrate for this operon, or whether it has enzymes with sufficiently broad specificity to use 
several related substrates. It is also not clear how catechol might be further metabolized. In Pseudomonas 
catechol is normally metabolized by either an ortho or meta pathway, and E. coli has some very distant 
sequence similarities to some of the meta pathway enzymes, especially to the penultimate step. In addition, 
MhpB can metabolize catechol at a slow rate (41). Further research will be needed to determine whether 
this is a physiologically significant pathway, and if so, under what conditions. 



Flagellar operons nearly identical to those of Salmonella. Escherichia coli has an array of 14 flagellar 
synthesis genes (bl070 to bl083), only two of which have been previously reported: flgM and flgL. One 
additional gene is involved with initiation of filament assembly: flgN, which precedes flgM, a negative 
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regulator of flagellin synthesis. In the region between flgM and CgL, we identified homologs of the 
SahnGnella t>phiirnuriiirn SgA (basal-body P-ring formatiGn), flgB (putative flagellar basal-bGdy formation 
protein), flC putative flagellar basal-body formation protein), flgD (basal-body rod modification protein), 
flgE (flagellar hook protein), flgF (putative flagellar basal-body formation protein), flgG (flagellar 
basal-body formation protein), flgH (flagellar L-ring protein precursor), flgl (flagellar P-ring protein 
precursor), fl (flagellar protein), and f IgK (flagellar hook-associated protein 1) genes. The gene 
arrangement of this cluster (positions 1,128,637 to 1,140,209) is identical to that of the cluster at 26.5 
centisomes on the Salmonella chromosome. In fact, the entire flagellar systems of E. coli and S. 
typhimurium are essentially identical in most respects, with the current organization of genes predating the 
divergence of these two species (43). Two additional genes (bl068 and bl069), preceding the fig genes, 
show strong similarity (81% and 94% identity, respectively, as well as near-equal length) to mviM and 
mviN, two Salmonella virulence factors (43). Homologs of both mviM and mviN also have been identified 
in Haemophilus (3). 

Open reading firames and gene fixnction class assignments. Figure 3 is a detailed graphical presentation of 
the genome showing the arrangement of putative and known genes, operons, promoters, and protein 
binding sites. Of the 4288 ORFs annotated in the sequence, 1853 are previously described genes. (A 
complete listing of E. coli ORFs is available at www.genetics.wisc.edu/ and is likely to change as 
fimctional data accumulate.) The distribution of start codons is as follows: ATG, 3542; GTG, 612; and 
TTG, 130. There is also one ATT and possibly a CTG (44). The distribution of translation termination 
codons is as follows: TAA, 2705; TGA, 1257; and TAG, 326. We assigned 405 genes with the start codon 
overlapping the preceding stop, distributed as follows: ATGA, 224; TAATG, 98; TGATG, 48; GTGA, 28; 
TAGTG, 4; and TTGA, 3. The most common overlap in phage lambda is also ATGA (45). 

The 4288 ORFs were searched for matches to the Link database of peptides excised fi-om two-dimensional 
gels (19). These searches confirmed the expression of 30 hypothetical ORFs. Li addition to the 194 Link 
sequences annotated in SWISSPROT release 34, our searches identified nine NH2-terminal sequences 
corresponding to dsbA, b2548, gcvT, glpQ, trpB, ydfG, ygaG, ygiN, and yifE. 

The longest ORF encodes a 2383-amino acid protein of imknown function, resembling several bacterial 
attaching and effacing proteins and invasins-virulence factors in pathogenic strains of E. coli and other 
enteric bacteria (46). The average ORF size is 317 amino acids; there are four ORFs in the range 1500 to 
1700 amino acids, 51 in the range 1000 to 1500 amino acids, and 381 that are smaller than 100 amino 
acids. In general, it was difficult to assign small ORFs unless they exhibited typical E. coli codon usage or 
had been characterized biochemically (for example, leader peptides). 

Two complementary catalogs were devised originally to classify functions of E. coli gene products, one for 
broad functions of the gene product (for example, enzyme, regulator, or transport protein) and another for 
specific physiological roles in the cell (47). A simplified composite system was devised to represent E. coli 
gene products ranging fi-om precisely known to loosely attributed functions in Fig. 3. Table 4 summarizes 
the functional class assignments used to classify each ORF. Pending the location of the coding sequences 
for 383 known E. coli proteins that are not yet associated with ORFs, nearly 40% of the ORFs are 
completely uncharacterized. This is similar to the proportion of unassigned ORFs in other recently 
sequenced bacterial genomes: Haemophilus influenzae (43%), Synechocystis sp. (45%), and Mycoplasma 
genitalium (32%) (3). 

The largest well-defined functional group consists of 281 transport and binding proteins, and there are an 
additional 146 putative transport and binding proteins. In contrast, 123 transport proteins have been 
identified in Haemophilus and 34 in Mycoplasma (3). Whether this difference reflects a larger number of 
substrates to transport, greater specificity of particular transporters, or greater redundancy in E. coli is not 



8 of 18 



11/5/01 11:16 AM 



Document 



http://proquest.umi.coni/pqdlink? Ver= 1 &Ex...b2D2qqUu%2bAkmEyiyxjQY80Tq 1 Zg3TvXe I IZXJI 



yet clear. In sharp contrast, the number of proteins involved in translation is similar for E. coli (182), 
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On the basis of 1 827 characterized E. coli proteins, Riley and Labedan (48) described 75 pairs of isozymes, 
or multiple enzymes with identical or nearly identical function. An additional 1 1 groups of potentially 
redundant enzymes have been identified among the newly sequenced ORFs. Although sequence similarity 
and functional overlap are not synonymous, these highly conserved proteins [point accepted mutations per 
100 residues (PAM) < 1 10] are likely to carry out the same physiological function. 

We have not yet attempted to represent proteins with multiple roles that depend on physiological 
circumstances. On the basis of our present knowledge, one-fourth of the cell's resources are devoted to 
small-molecule metabolism and about one-eighth to large-molecule metabolism, and at least one-fifth of 
the cell's resources are associated with cell structure and processes. Of course, this distribution may be 
altered when the specific functions of the remaining 40% of the gene products become known. 

Homology between E. coli proteins and the other sequenced genomes. Figure 3 also presents comparisons 
of the 4288 E. coli proteins with data from five other complete genomes (3), representing the three major 
kingdoms. There are two components to the significance of each database hit: the degree of similarity 
between the aligned proteins, and the amounts of the two proteins that are alignable. In Fig. 3, we have 
plotted a simple index that takes both components into account. 

To provide a preliminary estimate of the number of orthologous sequences shared by E. coli and each of 
these other complete genomes, we counted only matches including at least 60% of both proteins in an 
alignment with at least 30% identity. Each protein from another species was permitted only one match to 
an E. coli protein. The largest number of matches to E. coli is found in the Haemophilus influenza genome 
(1.83 Mb encoding 1703 proteins with 1 130 hits to E. coli proteins). Haemophilus, like E. coli, is a 
member of the gamma subdivision proteobacteria, making it the most closely related complete genome 
available for consideration (49). We also compared two additional eubacterial genomes: Synechocystis sp. 
(3.6 Mb, 3168 proteins, 675 hits) and Mycoplasma genitaUum (0.58 Mb, 468 proteins, 158 hits). All four 
eubacteria have 111 proteins in common. 

The numbers of matches across kingdoms in the archeon Methanococcus jannasch (1.6 Mb, 1738 proteins, 
231 hits) and the eukaryote Saccharomyces cerevisiae (12.1 Mb, 5885 proteins, 254 hits) are remarkably 
similar to each other. However, according to our significance criteria, only 16 proteins are conserved 
among all six taxa; they are largely translation proteins, including seven ribosomal proteins and two 
aminoacyl synthetases. One is classified as a hypothetical ORF in E. coli, Saccharomyces, and 
Methanococcus, but is described as a putative 0-sialoglycoprotein endopeptidase in both Haemophilus and 
Mycoplasma on the basis of similarity to a Pasteurella haemolytica protein (50). 

Nearly 60% of E. coli proteins have no match in any other complete genome considered. These may 
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represent the subset of proteins specific to enterobacterial or E. coU processes as well as insertion elements 
and phage with restricted host range. The 629 proteins shared exclusively by HaemGphilus and E. coli 
include new genes acquired in this lineage. The 292 proteins common to E. coli and just one of the other 
four species are indicative of numerous gene losses over the course of genome evolution. This preliminary 
analysis of similarity among sequences of complete genomes provides many avenues for further study. 

Similarity among E. coli proteins. Also presented in Fig. 3 is a comparison of all the proteins of E. coli 
with each other. These can be divided into families defined by sequence relatedness (5). A paralogous 
family is generally composed of proteins within a single species with similar, though not necessarily 
identical, fimctions. We define putative paralogs as ORFs that share at least 30% sequence identity over 
more than 60% of their lengths. The similarity index for the best putative paralog of each gene is plotted in 
Fig. 3. Many E. coli proteins- 1345have at least one paralogous sequence in the genome. The relative size of 
a gene family for each protein is also shown in Fig. 3. The largest number of significant hits to a single 
protein (bl917) was 37. This protein is a member of the largest family of paralogous proteins in E. coh, the 
ABC transporter proteins. Riley and Labedan (5) compiled a Hst of 54 ABC transporters among E. coli 
proteins, and analysis of the proteins fi:'om the complete genome reveals an additional 26 members of this 
family. Determination of the number of independent paralogous groups requires a carefiil examination of 
all the matches to a particular protein, followed by inspection of all hits to proteins contained within the 
initial hst of matches (5), and will require fiirther analysis. 

Many proteins are members of paralogous gene families and have significant matches in other species. It 
will be difficult, if not impossible, to unambiguously determine the relation between similar genes in 
different species when the level of divergence between orthologous genes approaches the level of 
divergence among paralogs within a species. The genes in all genomes are derived firom a set of unique 
ancestral genes present in a progenitor of all extant organisms. Upon duplication of an ancestral gene, 
copies of the gene may be subsequently lost through natural selection or simply by a neutral stochastic 
process. Alternately, the copies may be retained as redundant systems for executing the original biological 
fimction, or they may diverge, with one or both copies giving rise to a novel function. This process of 
duplication and divergence, along with the occasional transfer of genes between strains and species, gives 
rise to the present contents of a genome (51 ). Characterization of all E. coli paralogous groups and 
comparison with groups firom other species will allow examination of the evolutionary events surrounding 
protein diversification. 

Operons, promoters, and protein binding sites. Operons, promoters, and regulatory protein binding sites are 
shown in Fig. 3. A total of 2584 predicted and known operons are represented. Of 2192 predicted operons, 
a surprisingly high 73%» have only one gene, 16.6% have two genes, 4.6% have three genes, and 6% have 
four or more genes. All of them have at least one promoter, either known or predicted. Of 2405 operon 
regions with predicted promoters, 68%> contain one promoter, 20% contain two promoters, and 12%> 
contain three or more promoters. Regulatory sites are described in 603 regions corresponding to 16% of 
operon regions and 10% of interoperonic regions. We estimate that our search included representatives of 
15 to 25% of the total number of different regulatory binding proteins in E. coh, including sites that are 
recognized by global regulators of transcription (for example, sites bound by the cyclic AMP receptor 
protein, CRP). Within the regions with predicted sites, 89.2%) are regulated by one protein, 8.4%> by two 
proteins, and 2.4%» by three or more proteins. Li 81.2%» of these regions only one site was found, 12.2%> 
have two sites, and 6.6% have three or more sites. These numbers are more or less consistent with the 
distribution of regulatory sites among a set of promoters where transcriptional regulation has been well 
studied. In this collection of 132 promoters, 73%) are regulated by one protein, and 43% contain only one 
site for the binding of a regulator (52). A number of E. coli genes are part of known operons (Fig. 3, red 
arrows). 
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Repeated sequences. A number of repeated sequences have been characterized in the E. coli genome (53). 
The number and distribution of these sequences in the whole genorne are summarized in Fig. 2. The largest 
repeated sequences in E. coU K- 12 are the five Rhs elements (all previously described), which are 5.7 to 
9.6 kb in length and together comprise 0.8% of the genome. They have no known function, although strain 
comparisons suggest they may be mobile elements. The '-40-bp palindromic sequences variously referred 
to as REP, BIME, or PU constitute the largest class of repeats. They are often found as tandem copies, 
altemating in orientation, in complexes called REP elements. We have located 581 such sequences, in 314 
REP elements containing fi:om 1 to 12 tandem copies (see also Fig. 1). These elements account for 0.54% 
of the genome and are of unknown origin and fimction. These can be subdivided into distinct classes, as 
described by Bachellier et al. (53). Of the other known small dispersed repeats, we find four new IRU (or 
ERIC) elements, for a total of 19; four new copies of Box C, for a total of 33; and only the previously 
described six copies of RSA. The distribution of some of these repeated sequences may not be totally 
random; for example, Box C is absent over a 1-Mbp span in replichore 2. 

Another repeated sequence found in the E. coli genome is the Ter sequence, which acts as a one-way gate 
or valve to block the progression of the DNA replication fork such that replication starting fi-om the origin 
is prevented from progressing beyond the terminus marked by the dif site (54). Francois et al. (55) 
identified 10 different chromosomal fragments with homology to an oligomeric TerA probe, but only seven 
Ter sequences (TerA through TerG) have been identified to date. We foimd two new copies of the 1 1-bp 
Ter core sequence TGTTGTAACTA, both of which are located and oriented as expected relative to dif 

The sequence named LDR (1 1) occurs as three tandem copies at positions 1,268,308 to 1,269,848; a lone 
fourth copy, shorter and diverged from the consensus of the other copies, is located at positions 3,697,525 
to 3,697,888. In the region between positions 2,875,665 and 2,902,430, a 29-bp sequence called the iap 
repeat is found in three clusters of 14, 2, and 7 copies, for a total of 23 copies (53, 56). No additional 
copies of either of these sequences are found in the rest of the genome. 

Insertion sequences. The chromosome of E. coh K-12 contains a number of autonomously transposable 
elements that are implicated in the generation of many spontaneous mutations-not only by insertional 
inactivation, but also by deletions, duplications, and inversions. Estimates have been made as to the IS 
element set present in E. coli K-12 when originally isolated (57). The IS elements' map positions are shown 
in Fig 2. There are two multicomponent clusters. At positions 269,430 to 271,751, there is an IS91 1 -related 
sequence (65% match), which we term IS91 1 A, interrupted by a copy of IS30. At positions 4,504,683 to 
4,507,369, there is a more faithfiil copy of IS91 1 (designated IS91 IB), which is also interrupted by a copy 
of IS30 as well as by a piece of IS600. This is the only IS600related sequence in the genome. We did not 
find the copy of IS629 that had been suspected from hybridization studies (58). 

Cryptic prophage and phage remnants. As originally isolated, E. coli K-12 carried bacteriophage lambda 
plus the defective lambdoid prophages DLP12, Rac, and Qin, the element el4, and the recently described 
CP4-57 (59). Defective, or cryptic, prophages have lost some Sanctions essential for lyric growth and the 
production of infectious particles, but still retain other fimctional phage genes. They can rescue mutations 
in related infecting bacteriophages by recombining with them to generate viable hybrids. Figure 2 shows a 
histogram plot presenting all sequence matches to the phage proteins in SWISSPROT. In addition to 
clarifying the structure of the known prophages, we identified three new cryptic prophages. Moreover, we 
found numerous instances of isolated genes that are similar to bacteriophage genes. We call these single 
genes "phage remnants" to distinguish them from the larger cryptic prophages. Although this implies a 
phage origin-the last vestiges of a cryptic prophage ravaged by deletions-these genes may actually be 
homologs encoded by both a bacteriophage and its host, with no ready indication as to which genome was 
the original carrier. 
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We determined the precise endpoints of el4 in MG1655 (positions 1,195,432 and 1,210,646), including 
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(GenBank accession numbers Ml 9693 and Ml 9683). The 1829-bp Pin invertable P-region of el4 is in the 
(-) orientation in this sequence. The precise boundaries of the other lambdoid prophage remain to be 
annotated. The "cryptic P4" phage CP4-57 (59) is located at 57 minutes, where it is inserted into the stable 
RNA gene ssrA. The junction sequences (59) allowed us to identify the extended attL and attR sequences 
and to define the endpoints of the prophage (positions 2,753,956 and 2,776,007); our earlier report 
(GenBank accession number U36840) that attR was deleted in MG1655 was a misinterpretation. 

We have discovered two new cryptic prophages, seemingly related to CP4-57, which we name CP4-6 and 
CP4-44 after their minute positions. The three CP4 prophage are organized similarly and encode several 
similar proteins, although they do not share the same attachment sites. We infer that CP4-6 is integrated 
into tRNA gene thrW (60) because the 3' end of thrW is duplicated 34,242 bp downstream adjacent to 
b0281, a homolog of several integrases. This prophage (positions 262,122 to 296,489) includes argF, a 
known "duplicate" gene in the arginine biosynthesis pathway that has been suggested to have been acquired 
through a transposition event (61), It also includes the IS91 lA complex, a partial IS30 copy, two copies of 
ISl, and one copy of IS5. CP4-44 is less well defined (approximate endpoints at positions 2,064,181 and 
2,077,053) and we suspect that insertion of the IS5 at its left end may have been accompanied by a deletion 
of part of the prophage; although it shares other ORFs with CP4-6 and CP4-57, it has no candidate 
integrase or associated direct repeats that might be att sites. 

A third new cryptic prophage is located in the eut operon. Its presumptive integrase (b2442) resembles that 
of phiR-73, Sf6, and the CP4 family, but no other ORFs suggest its inclusion in the CP4 group. The 
endpoints of the element (positions 2,556,71 1 and 2,563,508) were defined by comparison with the 
sequence of Salmonella typhimurium, fi-om which the element is missing (62). The 8-bp direct repeat 
TCAGGAAG at the ends is present as a single copy in Salmonella. The W31 10 sequence from the 
Japanese group (http://mol.genes.nig.ac.jp/ ecoli/) is missing this element, which, in light of the K-12 
pedigree, suggests that this element is able to excise. 



Conclusion 



Although the determination of the complete E. coli sequence has required almost 6 years, this represents 
only the beginning of our understanding. Further research will be required to determine the precise 
functions for all of the genes by global transcriptional analysis, phenotypic analysis of mutants, and 
analysis of biochemical and catalytic properties of the expressed proteins. Another fruitful avenue for 
exploration will lie in whole genome comparisons-both with related pathogens to identify those genes that 
confer unique detrimental or beneficial properties, and with other microbial genomes to ascertain 
evolutionary relations. 




Fig. 3 
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